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Abstract 

Given a set of pairwise comparisons, the classical ranking problem computes a single ranking that best 
represents the preferences of all users. In this paper, we study the problem of inferring individual preferences, 
arising in the context of making personalized recommendations. In particular, we assume that there are n users 
of r types; users of the same type provide similar pairwise comparisons for m items according to the Bradley- 
Terry model. We propose an efficient algorithm that accurately estimates the individual preferences for almost 
all users, if there are r max{m, n} log to log 2 n pairwise comparisons per type, which is near optimal in sample 
complexity when r only grows logarithmically with m or n. Our algorithm has three steps: first, for each user, 
compute the net-win vector which is a projection of its (™)-dimensional vector of pairwise comparisons onto an 
m-dimensional linear subspace; second, cluster the users based on the net-win vectors; third, estimate a single 
preference for each cluster separately. The net-win vectors are much less noisy than the high dimensional vectors 
of pairwise comparisons and clustering is more accurate after the projection as confirmed by numerical experiments. 
Moreover, we show that, when a cluster is only approximately correct, the maximum likelihood estimation for the 
Bradley-Terry model is still close to the true preference. 


I. Introduction 

The question of ranking items using pairwise comparisons is of interest in many applications. Some 
typical examples are from sports where pairs of players play against each other and people are interested 
in ranking the players from past games. This type of ranking problem is usually studied using the Bradley- 
Terry model [|4] where each item i is associated with a score 0, measuring its competitiveness and 

P [item i is preferred over item j] = 


£>yi _|_ £>yj 


For this model, the maximum likelihood estimation of the score vector 6 can be solved efficiently [15]. 

There are other examples where comparisons are obtained implicitly. For example, when a user clicks 
a result from a list returned by a search engine for a given request, it implies that this user prefers 
this result over nearby results on the list. Similarly, when a customer buys a product from an online 
retailer, it implies that this customer prefers this product over previously browsed products. Businesses 
providing these services are interested in inferring users’ rankings of items. In these examples, users can 
have different scores for the same item and a single score vector is insufficient to capture individual 
preferences. Therefore it is more appropriate to view the user preferences as generated from the mixture 
of Bradley-Terry models. Though this mixture model has been used in many fields (See [1], |23j and 
the references therein), little is known about how to cluster the users and learn the individual preferences 
efficiently, and how many pairwise comparions are needed for a target estimation error. 

In this work, we study the following mixture Bradley-Terry model: users are clustered into different 
types; users of the same type have the same score vector; every user independently generates a few 
pairwise comparisons according to the Bradley-Terry model. Notice that under our model users of the 
same type will have similar but not necessarily identical pairwise comparisons. The task is to estimate 
the score vector for each user. Essentially, we would like to cluster the users using the observed pairwise 
comparisons and then estimate the score vector for each cluster. However, there are two key challenges. 
First, for each user, if we stack all the possible pairwise comparisons as a vector, this comparison vector 
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lies in a high dimensional space and only a small number of its entries are observed. Hence, directly 
clustering users based on the comparison vectors is likely to be too noisy to work well; our numerical 
experiments (see Section |VII[ ) confirm as much. Second, although standard algorithms like maximum 
likelihood estimation (T5), [|22J are available for estimating the score vector once the clusters (users of 
the same type) are exactly found, it is still unclear how the algorithms perform when the clusters are only 
approximately recovered. 

Our first contribution is to propose and show the effectiveness of clustering users according to their 
net-win vectors. A net-win vector for a user is a vector of length m, where its ?'-th coordinate counts 
the number of times item i is preferred over other items minus the number of times other items are 
preferred over item i according to this user’s pairwise comparisons. The effectiveness of net-win vectors 
in clustering users relies on the following surprising fact: the means of all the comparisons vectors are 
close to some (m — 1)-dimensional linear subspace; the net-win vectors are essentially the projection of 
the comparisons vectors onto this low-dimensional linear subspace. We show the projection to the net-win 
vectors preserves the distances between different clusters but the net-win vectors are much less noisy than 
the comparison vectors. Given good separations of the net-win vectors corresponding to different clusters, 
we show that a standard spectral clustering algorithm approximately recovers the user clusters. 

Our second contribution is to show that, even though the clusters have a few erroneously assigned users, 
the maximum likelihood estimator for the Bradley-Terry model is still close to the true score vector for 
this cluster. In our algorithm, as we only expect to approximately recover the user clusters, this robustness 
result ensures that we can still approximately recover the score vectors for most users. 

The results for the clustering and estimation steps can be combined to provide a performance guarantee 
for the overall algorithm. Our algorithm accurately estimates the score vectors for most users with only 
O (r max{m, n} log m log 2 n) pairwise comparisons per cluster, where r is the number of user types 
(clusters) and n is the total number of users. When there is only one cluster, it is known that Q(m) 


pairwise comparisons are required for any algorithm to accurately estimate the score vector [22]. Also, 


each user needs to provide at least one pairwise comparison; otherwise there is no hope to accurately 
estimate the preference for this user, so at least n pairwise comparisons are needed in total to learn 
individual preferences. When r is of order logm or log n, the sample complexity of our algorithm matches 
the lower bounds up to logarithmic factors. 


A. Related Work 

In this section, we point out some connections of our model and results to prior work. There is a vast 
literature on the ranking and related rating prediction problems; here we cover a fraction of it we see as 
most relevant. 

Rank aggregation has been extensively studied across various disciplines including statistics, psychology, 
sociology, and computer science [(6j, []9), [jT4j, (16j, (27). The Bradley-Terry (BT) model proposed in (5), 
[ T9) and its various extensions are widely used for studying the rank aggregation problem [2), (12), (T5) , 
[211, [24], (25]]. The classical results in [15] show that the likelihood function under the BT model is 
concave and the ML estimator for the score vector can be efficiently found using an EM algorithm. It 


is further shown in [ 22 ] and [ 131 that Q(m) pairwise comparisons are necessary for any algorithm to 
accurately infer the score vector and O(m log m) randomly chosen pairwise comparisons is sufficient for 
the ML estimator. In this paper, we show the ML estimator is able to estimate the score vector accurately 
even with a small number of arbitrarily corrupted pairwise comparisons. In addition to the ML estimation, 


several Markov chain based iterative methods have been proposed in [10|, [22], and have been shown 
to accurately estimate the score vector with H(mlogm) randomly chosen pairwise comparisons, which 
matches the sample complexity of the ML estimator. 

Previous works on rank aggregation, however, mostly focus on a single type of users and aim to 
combine the observed user preferences to output a single ranking that best represents the preferences of 
all users. Little is known about clustering and learning individual preferences when there are multiple 
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types of users. In this paper, we consider a mixed Bradley-Terry model to capture the heterogeneity of 
the user preferences. Our mixed BT model is closely related to the so-called mixed multinomial logit 
model studied in [jl] and [23]. Ammar et al. [1] studies a clustering problem similar to ours under the 
mixed multinomial logit model, where each user provides a set of i favorite items instead of pairwise 
comparisons, and users are clustered based on the overlaps between the sets of l favorite items. Under 
a geometric decay condition on the score vectors, the algorithm is shown to cluster users correctly with 
high probability if t — Qflog m) and there are only (poly(log m)) users. However, it is unclear whether 
the geometric decay condition holds in practice, and more importantly how the algorithm performs when 
there are a large number of users. Oh and Shah [231 studies a different problem of estimating the model 
parameters under the mixed multinomial logit model. A tensor decomposition based algorithm is shown to 
estimate the model parameters accurately with r L5 m 3 poly(logm) pairwise comparisons per component. 
If our algorithm is applied to estimate the model parameters, only rmpoly(logm) pairwise comparisons 
per component are needed [] Another mixture approach is proposed in []7j for clustering heterogeneous 
ranking data and an efficient EM algorithm is derived for parameter estimation. This method can take 
rankings of different lengths as input. However, no analytical performance guarantee is provided for the 
clustering. Very recently, a nuclear norm regularization approach is proposed in [18] to estimate the score 
vectors for all users. By assuming each user has a unique score vector and the score matrix 0* formed 
by stacking all score vectors as rows is approximately of low rank r, they prove the estimation error 
||0 — 0 *||f — o(n) if there are r max{m, n} log (max{m, n}) randomly chosen pairwise comparisons. 
However, it is not immediately clear how the nuclear norm approach performs in terms of the estimation 
error of the score vector for each individual. 

Finally, we point out that there is a large body of work studying the related problem of rating predictions. 
A popular approach is based on matrix completion methods [|8j, [ J_7 ], the incomplete rating matrix is 
assumed to be of low rank. Another line of work [|3J, [ 201, [ 28} assumes there are multiple types of 
users and users of the same type provide similar ratings. However, the rating based methods have several 
limitations comparing to the pairwise preference based methods. First, not all preferences are available 
in the form of ratings, while numerical ratings can be transformed into pairwise comparisons. Second, 
ratings are user-biased, e.g. a user may give higher or lower ratings on average than others, while pairwise 
comparisons are absolute. Third, pairwise comparisons are more reliable and consistent than ratings, e.g. it 
is easier for a user to compare two items than assign scores to them. Algorithmically, learning preferences 
from rankings is more challenging, because the vectors of pairwise comparisons lie in a (”')-dimensional 
space, while the vectors of ratings lie in an m-dimensional space. We overcome this challenge by a simple, 
but non-trivial projection of the comparison vectors into a low dimensional, linear subspace. 


II. Problem setup 

Consider a system with r user clusters of sizes K and m items and let n = rK. Each user u has a 
score vector for the items 9 U = (9 Ui i,..., 9 u/rn ), and he/she compares items according to the Bradley- 

e u i 

Terry model: item i is preferred over item j with probability g ' 9 . and vice versa with probability 

e u j 

g ’ o , . Assume users in the same cluster have the same score vector and denote the common score 
vector for cluster k by 0 k . As (9 k:1 , ..., 0k, m ) and (9 k} i + C,..., 9 k)Tn + C) for any constant C define the 
same probability distributions of pairwise comparisons in the Bradley-Terry model, 9 k is only identifiable 
up to a constant shift. To eliminate the ambiguity and without loss of generality, we always shift 9 k to 
ensure that JA 9 k)i = 0. 

The overall comparison result is represented by an "X (?) sample comparison matrix R. The w-th 
row R u is the comparison vector of users u. The columns are indexed by two numbers i,j = 1,... ,m 
with i < j , and the ij-th column corresponds to the comparisons for item i and j. For each user u, and 

'Since there is no need to estimate the preferences for every user in this context, the dependency on n in our sample complexity can be 
dropped. 
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items i and j with i < j, user u’s comparison result of item i and j is sampled with probability 1 — e 
independently, where e is the erasure probability. Let R Uyij — 1 if u prefers i over j, R u t] = —1 if u 
prefers j over i, and R u lJ = 0 if u’s comparison is not sampled. Then 


R 


U,lJ 


{ 1 W.p. (1 

0 w.p. e 
—1 w.p. (1 


0 

\ pU.l 

6 ) —Q - a - 

' e »«,. +e "oj 


A _ dXl _ 

1 e e u,i +e e u,j 


Our goal is to estimate the score vectors 9 U from R. 

To simplify the analysis, we will assume 6k s are generated independently as follows: for each k and 
i, generate 9® i i.i.d. uniformly in [0 ,b\, and then define 


6k,i = 6' 


k,i 


l 

m 


6k,i- 


Clearly, £ 

@k,i @k,j 


@k,i 0 and 


= 




< b for any k, i and j. Notice that 9k,i are not independent, and 


A. Notation and Outline 

Let X = cr t u t vj denote the singular value decomposition of a matrix X 6 M nxn such that 

o - ! > ■ ■ • > cr n . The spectral norm of A" is denoted by ||X||, which is equal to the largest singular value. 
The best rank r approximation of X is defined as P r {X) = Y^t=i a t u t vJ. For vectors, let (x, y) denote 
the inner product between two vectors; the only norm that will be used is the usual U norm, denoted as 
||x|| 2 - In this paper, all vectors are row vectors. We say an event occurs with high probability when the 
probability of occurrence of that event goes to one as m and n go to infinity. 



III. Algorithm and Main Result 

Our algorithm for clustering users and inferring their preferences is presented as Algorithm |T| The 
basic idea is to estimate 9 in two steps: cluster the users and then estimate a score vector for each cluster 
separately. 

The difficulty lies in the clustering step. Recall that, in our problem, each user is represented by a 
comparison vector of length (™), and only roughly (1—e) (™) of its entries are observed. These comparison 
vectors are so noisy that directly clustering them result in poor performance, a fact which we confirm in 


our experiments in Section VII 


We overcome this difficulty by reducing the dimension of the comparison vectors. Consider user u with 
comparison vector R u . For each item i, define the normalized net number of wins 

^ ^ P{7/ prefers i over j} ( P) prefers j over i} ( P ) J 


s„ 




a +y r *.< 


j<i 


j>i 
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Algorithm 1 Multi-Cluster Projected Ranking 

Step 0: Sample splitting. Let Ll be the support of R, i.e., Ll = {(u,ij)\R u ,ij 0}. We construct two 
sets fli and Q 2 by independently assigning each element of Q only to Q, with probability (1 + e)/4, 
only to with probability (1 + e)/4 and to both A and Q 2 with probability (1 — e)/4. Define 

Su,ij = Ru,ij^-{(u,ij)en i} an d R u ,ij = S u ,ij^-{(u,ij)en 2 }- 

Step 1: Denoising. Let 5 = ^R^A T . The w-th row of S is the net-win vector of user u. 

Step 2: User clustering. Let S be the rank r approximation of S. Construct the clusters Ci,... ,C r 
sequentially. For 1 < k < r, after C\.... ,C k ~i have been selected, choose an initial user u not in the 
first k — 1 clusters uniformly at random, and let C k = {u' : 11 S u — S v ; \ | 2 < r} where the threshold r is 
specified later. Assign each remaining unclustered user to a cluster arbitrarily. 

Step 3: Score vector estimation. Let D u>i j = Hr (2 ) _ \ and D nJ , = Hr (2 ) \ for any u and i < j. 

For users in cluster C k , the estimated score vector is given by 0 k = arg max 7 L k (y), where 

E eii 

D u ,ij log 

e 7 * + e 77 

ueC k ,i,j 


We call S u the net-win vector of user u. To simplify the notation, let A 6 {±1, 0 |'" x (^) be the matrix 
with the ij -th column being ej — e 3 , where e t is the length m vector with all Os except for a 1 in the /- 1 h 
coordinate, then it is easy to verify that 


S u 



( 1 ) 


The effectiveness of net-win vectors in clustering users relies on the following surprising fact: the 
expected comparison vectors E [R, v ] for all users, which are ("')-dimensional nonlinear functions of the 
score vectors 9 U , are close to the (m — 1) -dimensional linear subspace spanned by the rows of A, or the row 
space of A. It suggests denoising the R u ’s for all users by projecting them onto the row space of A. The 
projections of the R u ’s turn out to be isometric to the net-win vectors S u ’s. In particular, recall our definition 
of S u given in (171), the term -j—A 1 acts just like an orthogonal projection onto the row space of A. We show 
in Section |IV| that for any two users v, w in two different clusters, ||E S v ] -E [S w ] H 2 ~ ||E [R v ] — E [R w ] || 2 
and ||,5' u — E [S u ] || 2 ~ ~^\\R U — E [R u ] || 2 for u = v,w. Therefore, the net-win vectors S u are much less 
noisy and easier to separate than the comparison vectors R u . 

We then cluster the net-win vectors by a standard spectral clustering algorithm in Step 2 of Algorithm [Tj 
Let {C k } denote the true clusters and {C k } denote the clusters generated by Algorithm [l] with threshold 
r. Since the clusters are only identifiable up to a permutation of the indices, we define the number of 
errors in {C k } as 

min V' | C k A C n(k) \, 

7 T * 

k 


where A denotes the symmetric difference of two sets. The following theorem highlights a key contribution 
of the paper: the projection of comparison vectors to the row space of A results in significant denoising, 
which allows for accurately clustering the users with only a small number of pairwise comparisons. 

Theorem 1 . Let r = If b E [60 ; 5] for any arbitrarily small constant bo > 0 or b > C"m 3 log m 

for some constant C", then with high probability, there exists a permutation 7 r such that, 


| Ck A Cn(k) \ < 


512r max{m, n } log m log n 




(1 — e)m 2 
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and 


In particular, when 


V Ck A Cn(k) 

k 

(1 — e)Km 2 > 512 r max{m, n} log m log 2 n, 


< 


1024r max{m, n} log m log n 


(1 


e )m* 


the fraction of misclustered users in each cluster k, i. e., 

I Ck A Cn(k)\ < 1 

K ~ log n 

Theorem |T] implies that if m and n are on the same order, roughly each user only needs to give 
r 2 poly(logm) pairwise comparisons to allow for the correct clustering for all but K/logn users. Notice 
that a user needs to give at least one pairwise comparison. The specific choice of r is just for simplicity 
of the proof, which can be relaxed to r — C^== for any constant C > 0. The lower bound b > bo 
is required. Consider the extreme case where 6 = 0, then the the score vectors for all users are all-zero 
vectors and the clusters are unidentifiable from pairwise comparisons. The upper bound b < 5 is an artifact 
of our analysis as shown by our numerical experiments. Note that if b = 5, then the most favorable item 
is preferred over the least favorable item with probability approximately 0.993. 

After estimating the clusters, Algorithm [T] treats each cluster separately and estimates a score vector 
using the maximum likelihood estimation for the single cluster Bradley-Terry model. In order to avoid the 
dependence between Step 2 and Step 3, we generate two smaller samples B l 1 j and R (2> by subsampling 
R, and use them in the two steps respectively. It is not hard to verify that the support sets fix and Q 2 are 
independent. 

The overall performance of Algorithm [T] is characterized by the following theorem, which shows that, 
when the number of pairwise comparisons is large enough, the estimations of the score vectors are accurate 
for most users with high probability. 


Theorem 2. Define 


Vi = 


r max{m, n} log m log n 
(1 — e)Km 2 


V2 = 


log m 


(1 — e)Km 


Assume b G [bo, 5] for any arbitrarily small constant bo > 0, then there exists a constant C > 0 such that 
with high probability 


|| 0 «- 0«||2 < C(e b + 1) 


lift 


u 2 


be b 


max {rji, r] 2 } 


except for 512AT/1 users. In particular, if Km 2 (1 — 


> 


r max{m, n} log m log 2 n, then = 


users. 


0(A—) except for O ( 

Theorem [2] shows that the estimation error depends on the maximum of rp and ?/ 2 : r/j characterizes the 
fraction of misclustered users in a given cluster as shown by Theorem [T} r/ 2 characterizes the estimation 
error of the maximum likelihood estimation assuming the clustering is perfect. If r = 1, then there is 
no clustering error and the estimation error only depends on 772 which matches the existing results in 
[22] with a single type of user. The lower bounds in [22] and [ 13[ show that at least Q(m) pairwise 
comparisons per type are needed to ensure = o(l) even when clusters are known. Also, a user 

needs to provide at least one pairwise comparison for us to infer his/her preference, which means that at 
least f2(n) pairwise comparisons in total are required to infer the preferences for most users. Theorem [2] 
shows that Algorithm fll needs approximately |(1 — e)Km 2 = 0(r maxjm, n} log m log 2 n) comparisons 
per cluster, which matches the lower bounds up to logarithmic factors if r is poly-logarithmic in n or m. 
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IV. Denoising using Net-win Vectors 

In this section, we analyze Step 1 of Algorithm [Tj We first argue that directly clustering based on the 
comparison vector is too noisy to work well. Then, we show the net-win vectors preserve the distances 
between different clusters from a geometric projection point of view. Finally, we prove the net-win vectors 
are much less noisy than the comparison vectors. 

Recall that R u is a (™)-dimensional vector of all pairwise comparisons for user u. For any i < j, the 
mean of the ij-\h entry is 


E [R Utij ] = (1 - e) 


r Vu,j 

A 




(1 - - Ouj), 


where f(x) = pryy- Since two users from the same cluster have the identical score vector, the means of 
their comparison vectors are also identical. With a slight abuse of notation, let R k denote the common 
means of the comparison vectors for users in cluster k, where k — 1,..., r. For k f k' , we call \\R k — R ' k \\2 
the distance between cluster k and k!. It is easy to check that \\R k — R ! k || 2 = 0 ((1 — e)m) with high 
probability. In other words, the distances between different clusters are roughly 0 ((1 — e)m). Hence, if 
we observe the means of all the comparison vectors, then clustering becomes trivial. In our problem, for 
each user u, we only observe R u , which is a noisy observation of E [R, v ]. More specifically, since the 
expected number of comparison a user provides is (1 — e)(™). 


E [H-Ru — E [i? u ] HI] =^V a r[R u , l j] = © ((1 - e)m 2 ) . 

i<j 


Therefore, we would expect the deviation \\R U — E [R u \ || 2 = 0 (m\/l — e), which is much larger than 
the distances between different clusters given by 0 ((1 — e)m). As a result, the comparison vectors for 
two users from the same cluster are likely to be far apart, while the comparison vectors for two users 
from different clusters might happen to be close. Therefore, the comparison vectors are too noisy to be 
clustered directly. 

In the following, we explain how to denoise the comparison vectors. An interesting observation is 
that the mean of the comparison vector E [/?,„.] lies close to an (m — 1)-dimensional linear subspace. In 
particular, using the definition of A, we get 


E[R u } = (l-e)f(9 u A), 

where for a vector v E M"\ f(v) = (f(y i),..., f(v m )). Although / is a non-linear function, we are able 
to show the angle a between E [R u ] and the (m — 1)-dimensional linear subspace spanned by the rows of 
A, or the row space of A, is not large. To see it, let us first assume b is small. Recall that 1 6 kti — 9 k j\ < b 
for any k,i and j. In this regime, we can linearize the function / at 0 and get 

f(6 u A) « l -9 u A, 

which means that E [R u ] is approximately on the row space of A and the angle a ~ 0. Somewhat 
surprisingly, a is still not too large even if b becomes so large that the linear approximation does not 
work any more. Consider the extreme case when b —* oo, under our assumption that 0 al are uniformly 
distributed, we have \0 Uti — 9 u j\ -E oo, thus 

f(9 u A) -e sign(6> u A). 


The following lemma shows that a is approximately 30° in this case. 

Lemma 1 . For any 9 k E M m and assume 9 k ,i ^ 9 k j for any i and j. Define row vector 7] E { — 1, +1}( 2 ) 

as rjij = sign(0fc j j — 9 k j). Then the angle between // and the row space of A is arccos yj | in the limit as 
m —>• oo. 





For the intermediate range of b, we do not have an analytical result on the upper bound of the angle 
a. Through extensive simulation as plotted in Figure [I] we can see that the cos a averaged over 100 

independent simulations decreases monotonically with b and it is always upper bounded by arccos J |. 


m = 100 



Fig. 1: Cosine of the angle between f{0 u A) and the row space of A for various b. 


The observation that E [R u ] is close to the row space of A suggests that we may denoise the comparison 
vectors by projecting R u onto the row space of A. We show in Lemma [6] that the SVD of A is given 
by A = s/m,UV T , where U E and V E Since the row vectors of V T form an 

orthonormal basis of the row space of A, the projection of R u onto the row space of A is given by 
R u VV t and when represented in the basis V T , the projection is simply R„V■ Interestingly, we find that 
R U VV is isometric to the normalized net-win vectors q] used in Algorithm [T] 


Q A 


R,A 


m 


r u vu t . 


Since the rows of U T form an orthonormal basis, when represented in the basis U T , S u is simply R U V, 
which is exactly the same as R U VV T when represented in the basis V . Hence, the net-win vectors are 
equivalent to the projection of comparison vectors into the row space of A. The benefit of using the 
net-win vectors instead of doing the projection is that they have a more clear physical meaning and are 
easier to compute; there is no need to compute the SVD of A, which is prohibitive when m is large. 

Since E [S'u] = -yyyE [R u ] A T , two users from the same cluster have the same expected net-win vectors. 
With a slight abuse of notation, let S*. denote the common expected net-win vectors for users in cluster k, 
where k — 1,... ,r. The following lemma confirms that after the projection, || S k — S^'lb x \\R k — R k i || 2 
and thus the projection preserves the distances between different clusters. 

Lemma 2. Assume m > C' log r for some constant C'. If b E [6 0 ,5] for some constant 0 < b 0 < 5 and 
b > C"m 3 log m, then there exists some constant C such that with high probability, for any k f k', 

ll-Sfc - S k '|| 2 > C{ 1 - e)m. 


Remark 1 . The lower bound b > b 0 is necessary. When b is too small, 6 k ’s all become very close to the 
all-zero vector and the distance between different clusters given by \\Rk — Rk'W'i is too small to distinguish 
different clusters. Even though our theorem requires b E [&o, 5] or b is very large, our experiment shows 


2 In Algorithm [T| we generate two independent samples from R and S u 
of notation. 


is defined using Rf\ Here, we simply write R\I’ as R u for ease 


(i) 
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that the Sk’s are in fact well separated for any h > 6 0 . Moreover, the proof indicates that Lemma [ 2 ] 
applies to general pairwise comparison models as long as the probability of item i is preferred over 
item j minus the probability of item j is preferred over item i can be parameterized as f(6, — Of) for 
some sigmoid function f (In BT model, f(x) = p/fj); the upper bound b < 5 changes to b < c, where 
c = min{6 : |/(6) - /'(0)6|/6 > f'(0)/y/5}. 

Next, we show the net-win vectors are muc h less noisy than the comparison vectors. In particular, 
let S u = E [S'J, and then \\S U — S u \\ 2 < 3 yjffrn log n, which is much smaller than the deviation 
\\R U ~ E [R u ] || 2 = © (my/1 - e). 

Lemma 3. If (1 — e)m 2 > 36 log n, then with high probability, 

11 Ru - SJI 2 < ^J—^—mlogn, Vu. 

Notice that Lemma [3] is independent of the pairwise comparison model. Together with Lemma [2| it 
shows that the projection of comparison vectors into the row space of A preserves the distances between 
different clusters and at the same time dramatically reduces the noise variances. In particular, if m(l —e) = 
f2(logn), then the net-win vectors corresponding to different clusters are well-separated; K-means or some 
thresholding-based algorithm is going to work. In the next section, we will show that spectral clustering 
based on the net-win vectors does even better and works if m(l — e) = f2(r 2 poly(logn)) when m and n 
are on the same order. 

Finally, we point out that the idea of projection or equivalently the net-win vectors introduced in this 
subsection, is not specific to the BT model and is applicable to general pairwise comparison models. 

V. User clustering and score vector estimation 

In this section, we analyze Step 2 and 3 of Algorithm |Tj Step 2 clusters the net-win vectors S u 
by a variation of the standard spectral clustering algorithm. After clustering the users, the algorithm 
estimates the score vectors for each cluster using sample R l2) . Recall that the supports of R i V) and R (2) 
are independent, which is important for the analysis to decouple the two steps. 


A. User clustering 

Step 2 of Algorithm [j] first computes the best rank r approximation S of S, and then clusters the 
rows of S' by a simple threshold based clustering algorithm. The reason we consider this threshold based 
clustering algorithm is that it is easy to analyze. However, in the experiments we see later, the more robust 
A'-means algorithm is used instead. 

The use of S can be understood from a geometric projection point of view. Let S — E [S] . Since the 
users from the same cluster have the same expected score vector, the rank of S' is r. In other words, 
the expected net-win vectors E [S/] lie in an r-dimensional subspace of R rn . Therefore, similar to the 
projection idea introduced in Section IV we may de-noise the net-win vectors S u by projecting them 
onto this r-dimensional subspace. However, this r-dimensional subspace is determined by S which is 
unobservable. Here the key idea is that S' is a perturbation of S and thus the space spanned by the top 
r right singular vectors of S' is close to the desired r-dimensional subspace. Hence, we can de-noise the 
net-win vectors S u by projecting them onto the space spanned by the top r right singular vectors of S', 
which are exactly S u . The following lemma shows that such a projection is effective in de-noising. In 
particular, it shows ||S — S||^ = O ((1 — e)r max{m, n} log 3 n) , which is much smaller than the deviation 
bound US' — S\\ 2 F = O ((1 — e)mn log n) as shown by Lemma [ 3 } 
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Lemma 4. If (1 — e)m 2 > 36 log n, then with high probability, 

IIS -S||< Ml -t) max{m, n } log 3 / 2 n. 
\\S-S\\ F < 16^2(1-e) r max{m, n} log 3 / 2 n. 


Using a counting argument together with Lemma Q we can show, for most users u, S u is close to its 
expected comparison vector S u . 

Lemma 5. Let r = ^j==, then with high probability, there are at most 

512 r max{m, n} log m log n 
(1 — e)m 2 


users such that \S u — S u \ |2 > 

Combined with the fact that the Sf s are well separated as shown in Lemma [ 2 J we get Theorem [T] 


B. Score vector estimation 

In Step 3, Algorithm |T] estimates the score vectors for each cluster separately When there is no clustering 
error, the problem reduces to the inference problem for the classical Bradley-Terry model. In particular, 
if we let Wij be the number of times item i is preferred over item j, then the ranking problem can be 
solved by the maximum likelihood estimation 


9 = arg max E Wij log 


oli 


e* + en 


The above optimization is convex and can be solved efficiently [ f5J. Further, the recent work p2| provides 
an error bound for 6 when the pairs of items are chosen uniformly and independently. 

In general, the clustering step is not perfect, but if there are sufficiently many pairwise comparisons, 
Theorem [T] shows that the clusters can be approximately recovered with high probability. In this case, 
Algorithm |T| simply views the users in each cluster as from the same true cluster, and again solves the 
optimization problem corresponding to the maximum likelihood estimation for the Bradley-Terry model 
for each cluster. 

Take one such cluster C as an example. Recall that |CAC| denote the set difference between the true 
cluster C and the estimated cluster C. It follows that at most \CAC\ users in C. are from other clusters and 
at most \CAC\ users in C are assigned to wrong clusters. To simplify the notation, we omit the subscript 
and use 9 to denote the true score vector for the cluster C. throughout this section. Let 9 be the estimated 
score vector for cluster C. The following theorem shows that when the number of comparisons is large 


enough, the relative error 


Mg-glla 


goes to zero when \CAC\/K —* 0. We should emphasize that 9 is only 


a good approximation for the score vectors of the users from cluster C. 

Theorem 3. Let C denote an estimator of a fixed cluster C. Then there exists some constant C such that 
with high probability 


\\9-9\\oC(e b + l) 2 f I log m \CAC\ \ 
||0|| 2 be b maX {y (l-e)iTm’ K J’ 


Theorem [3] extends the previous results in |22j to the setting with clustering errors. Notice that the 
error bound scales exponentially with b. This is likely to be an artifact of our analysis and also appears 
in previous results in [22]. 
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VI. Proofs 

In this section, we present the proofs for the main theorems first and then the lemmas. The proof of 
Theorem |T| uses Lemma [2] and Lemma [5] We prove Theorem [2] by combining Theorem [I] and Theorem [3] 
We first introduce some additional notation used in the proofs. Let / denote the identity matrix. Let 1 
denote the vector with all-one entries and 11 T denote the matrix with all-one entries. For a m x n matrix 
X and a m x 1 vector v, let [.X , v] denote the m x (n + 1) matrix formed by adding v as a column to 
the end of X. For two n x n matrices X, Y, we write X < Y if Y — X is positive semi-definite. 


A. Proof of Theorem [7] 

Recall that r = ^==- We say a user is a good user if ||5 U — S u || 2 < Under the assumption of 
Theorem [T] the condition of Lemma [2] holds. Then in view of Lemma [2| for all good users u, 

||S , u -S tt || 2 < T ~ < i||5 fc -5 fc ,|| 2 . 

Let X be the set of good users and Lemma [5] shows that the number of bad users 


\r\ < 


512r max{m, n} log m log n 
(1 — e)m 2 


Following the proof of Proposition 1 in [281, we can conclude that there exists a permutation 7r such that, 


I C k 


A C n{k) \ < 


\l c \ for all k and \C k 


A C n{k) \ <2|X C 


B. Proof of Theorem [2] 

From Theorem [I] we get that there exists a permutation 7r such that, 


, 512r maxim, n] log m log n . ,, 

\C k A C Ak) \ < ---^-, Vfc. 

(1 — e)m z 


We then apply Theorem jsj and get the result of Theorem |2j If we want to achieve ^ f) ^ 
for the good users, we need 

r max{m, n} log m log n 1 


I 


\\0u-9u\\2 


= 0 ^> 


(1 — e)Km? 


' log n 


logm 


< 


(1 — e)Km log n ' 


which requires Km 2 { 1 — e) > r max{m, n} logm log 2 n and Kmn{ 1 — e) > rn log m log 2 n, respectively. 
Notice that the former condition is more stringent than the latter one; thus the clustering step needs more 
pairwise comparisons than the score vector estimation step to achieve the same error rate. 


C. Proof of Theorem [i] 

Let p(m, n ) A fXf denote the set difference between the true cluster and the estimated cluster. Recall 
that D Uii j = Ir ( 2 ) \ and D uji = Ir (2 )_ \ are the random variable indicating u’s comparison result 

of i < j. The estimated score vector is given by 6 = argmax 7 X( 7 ), where 

i(7 ) = 
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Let B Uii j = B U ji = I| R ( 2 ) _^ 0 | be the random variables indicating if u compared i and j. By definition, 

—-Ft eli \ 

o^i e 7 * + e 77 

u,j 

0 2 L x — e^e 77 

^7? y (e 7i + e 77 ) 2 

d 2 L e 7i e 77 

d'fid'yj (e 7i + e 77 ) 2 ’ 

where B %3 = B u ^ 3 . Let A = 6 — 6. As 6 is the optimal solution, 


0 <L(0) - L(0) 

=(Vi(0),A> + iA T (V 2 L( 7 ))A, 

where the second step is by Taylor expansion and 7 = 0 + A A for some A £ [0,1]. Define Lb = 
diag(-Bl) — B, where diag(v) denotes the diagonal matrix formed by vector v and L B is known as 
Laplacian. By Cauchy-Schwartz inequality, 


!|VL(0)|| 2 ||A|| 2 >-A t (-V 2 L(7)A) 


1 


£ ( A * - A 


h3 


2 e 7i e 77 
A (e 7i + e 77 ) 2 


> 


'2(e b + 1) 


-A t L r A, 


where the second inequality follows because y — s ' ncc 1 7* — 7j| < 0 for any i,j. 

Let Z uAj = D Uiij - B U)l] e7 . e ^ g7j . . First we bound ||VL(0)|| 2 . For each i. 


dL 

06, 



j,ueC 


E ■ 

j,uec\c 


J u,lj 


E ■ 

j,uec\c 


J u,ij • 


The first term is independent of C. For u £ C, E [Z Uti j] = 0 and Var [Z UJJ ] < lyL By Bernstein’s 
inequality, with high probability for large m. 



j,u£C 


<C\ a/(1 — e)Km log m. 


We bound the next two terms by 


- E ■ 

j,uec\c 


J u,lj 


E ■ 

j,uec\c 


u ,ij 


~ y '] Bum 

j,uei c 


Since the matrix B only depends on Q 2 but not the comparison results, the right hand side above is 
independent of /A |J or C. As B u are independent Bernoulli random variables with parameter 1 — e, with 
high probability for large m, Yhj U ei c B u ,ij A C 2 ( 1 — e)mKp(m,n). Thus, 


E ■ 

j,u£C\C 


J u,ij 


E ■ 

j,uec\c 




<c 2 (l 


e)mKp(jn, n). 
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Therefore, 


||VL( 0 )|| 2 < 6*2(1 — e)Km 3 / 2 max 


logm 


-,p(m, n ) 


(1 — e)Km ' 

Next we bound A T L^A. Again by the fact that B is independent of if 11 , we can simply follow the 
proof of Theorem 4 in [ 221 and get 

A t £ b A> j(l-e)Jfm||A||l, 

with high probability for large m. Combining the above results, we get the upper bound on ||A|| 2 


1 A 11 ^ 6 3 (e 6 + 1)" 

A 2 S --Vm max 


logm 


- ,p(m , n) 


(1 — e)Km ’ 

On the other hand, similar to the proof of Lemma [ 7 J we can show that ||0|| 2 > . Therefore, 


||0-0|| 2 _||A|| 2< TV + l)^ max 


logm 


11*11 


11*11 


be b 


(1 — e)Km 


,p(m, n) 


D. Proof of Lemma [7] 

We first present a lemma on the properties of A (proved in the Appendix). Recall that A G {±1, 0} mx ( 2 ) 
is the matrix with the ij-th column being e % — e v where e, e (0, l} m is a vector with all Os except for a 
1 in the /-th coordinate. 

Lemma 6. The matrix A is of rank m — 1 with SVD A = ^fmUV T , where U € and V G 

R( 2 ) x ( m_1 ). Moreover, the l 2 -norms of the rows of U and V are sj{m — 1 )/m and \J2 jrn, respectively. 

We are ready to prove Lemma [I] Since V T is an orthonormal basis of the row space of A, the projection 
of a row vector r/ onto this space is given by r/VV . When represented in the basis V , the projection 
is simply rjV. Moreover, the cosine of the angle between // and the row space of A is ^|| 2 . 

Using the properties of A proved in [6j we have 

II 77 UII 2 = r]VV T r] T = —r]A T Ar] T = — I |v4?7 T II 2 . 

m m 

Note that 

(Ari T )i = #{j : 0 k j < 6 k)i } - #{j : 9 kjj > 9 k ,i}- 

By the assumption that 9 k l f 9 k ,j for any i and j, the vector Arj 1 is always a permutation of the 
deterministic vector 

[—(m — 1), —(m — 3),..., m — 3, m — 1] T 

representing the net wins of the items. Therefore, 

WvV\\l =—||[-(m-l),-(m-3),...,m-3,m- 1]|| 2 
m 

= ^(m 2 - 1). 

Since [ [ 77 I || = |m(m — 1), the angle between p and row space of A is arccos J | in the limit as m —» 00 . 
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E. Proof of Lemma [2] 

We prove the lemma by considering the two regimes of b separately in the following two lemmas. 

Lemma 7. Assume m > C'logr. If b G [bo, 5], then a.a.s. there exists some constant C such that for any 
k f kf 

11^-5*11 >C(l-e)m. 


Proof: Due to space constraint, we only sketch the proof in this subsection. A full proof is provided 
in the Appendix. 

By definition, = ] -kf f(9 u ,i — 9 UJ ). The function f(x) is nonlinear but it can be approximated by 
the linear function x/2 when x is close to 0. Since 0 kl — 9 k f < b for all i,j, the maximum approximation 
error is given by 5(b) = \ f(b) — ||. 

By definition, for any k, 

s t = l —if(eiA)vu J 

+ ff(MA) - \elA)vu\ 

Then, by triangle inequality, 




C~.—s/m\\(6 k - 0 V )U\\ 2 


+ \\(f(el,A) + -9lA)V\\ 2 


\\(f(0 T k A)--9 T k A)V\\ 2 


1 -e 

>- \/m 


2 II 6k — 9k' H 2 


5(b) 


k 2 


+ 11 e 


k' 21 


Using Hoeffding’s inequality and Bernstein’s inequality, we show that, with high probability, 

||0fc - 9y 11 2 > y/0.9mb 2 /6, \\9 k \\ 2 < s]l.lmb 2 /l2. 


Therefore, 


||+ fc -^'|| 2 >- ~\ ~ 


If 1/0.9 _/l.l<5(6) 

2 \ 2 V ~6 V 1 V 


6(1 — e)m 


>C(1 — e)m, 


under our assumption on b. ■ 

Lemma 8. Assume m > C'logr. If b > Cm' log m, then a.a.s. there exists some constant C such that 
for any k kf 

\\S k -S k ,\\>C(l-e)m. 


Proof: Due to space constraint, we only sketch the proof in this subsection. A full proof is provided 
in the Appendix. 

The assumption that b is large implies that, with high probability, \0 k , — 9 k A > 1 for any k and i f j. 
By definition, ^ f(9 k A)VC. Define " 

9k,ij = ~^{e kti <e kij } 
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to be the signed indicator variable of the order between 9 k)i and 9 k] , and f(9 k A) is close to r) k when b 
is large. Then 


\\S k -S k ' | 


2 

1 - e 


> 


\\(f(9 k A)-f(9 k ,A))V\\ 2 
II (Vk - Vk')V || 2 - || (f(O k A) - T] k )V || 2 


\\(f(9 k ,A)-ri kl )V\\, 


First by a counting argument we show that \\(f(9 k A) — i) k )V\\ 2 < C\yjm. 

Next we show that \\(r] k — r] k fV || 2 > C 2 rn. Observe that 

life - nv)V\\l =||t)WI|j + I- 2))WV' T % T . 

2 , , . 2 aT a T 

= -(m - 1 - VkA T Ar/l , 

S m 

where the second equality follows from Lemma[TJ By definition of A, (///,./I'), represents the number of 
9j that are smaller than 9, minus the number of 9j that are larger than 6 l: . Therefore, >j k A and r] k iA T are 
independent random permutations of the deterministic vector [— (m — 1), — (m — 3),..., m — 3, m — 1]. 
Without loss of generality, assume tj k A T = [— (m — 1), — (m — 3),..., m — 3, m — 1] and denote ij k 'A T 

by x which is a random permutation of rj k A T . Let Z = r/ k A T Ar]J, = — (m — l)xi + • • • + (m — l)x m 

and define the martingale ?/, = E [Z\xi ,..., xi\. Using Azuma’s inequality on Z, we can show that 
\r] k A T Ar]J,\ < C 3 m 5 / 2 log 1 / 2 m with high probability, thus | (///, — ///,./) U11 2 > C 2 m. 

Combining the above two steps, we conclude that \\S k — £ fc /|| > C(1 — e)m. ■ 


F. Proof of Lemma [3] 

Recall that for i < j, Vij denote the (z. j)-th row of V. Rewrite | S u — S u \ as 


l|5„ - s„|| = II - Ri% )V v \\ & II *£ Z<l\\, 


i<j 


i<j 


where Z^ = {R^jj - RuJjWij- Note that ||Zjj|| 2 < ||Uy|| 2 = y/2/m. Since Var 


R 


(i) 

u,ij 


< E 


(!) \2 

,ij) 


( 


ff, we have 


E E DI Z «I0 


< 


i<j 


1 — e (m\ 2 1 — e 

2 / m ~ 2 


A 2 

m — a . 


Now we apply the vector Bernstein’s inequality [11, Theorem 12]. We choose t = -iaflog n and under 
our assumption it satisfies t^/2/m < a 2 . Then, for any u. 


P 


\\Su - ‘S'rtl | 2 > 4cr v / log ~n <P [II S u - S u \\ 2 > a + t] 


<exp(-^)<l/n 2 . 


Applying the union bound, we get the result. 
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G. Proof of Lemma [?] 

We bound ||5 — 511 by the matrix Bernstein’s inequality [26]. Let X u = e u (S u — S u ), then 5 — 5 = 
X u . First we bound X„||. Since 

\\X U \\ 2 =\\X u Xl\\ = ||5 u -5 u || 2 ||e u eT|| = \\S U — S u \\l, 

by Lemma j^J | |X U | < 3^ mlogn with high probability. Next we bound 

a 2 ^ max{|| [X n X u T ] ||, || ]Te [XjX u ] ||}. 


The covariance matrix for S u is = V T D u V, where D u = diagQVar [R u ,ij]]ij) < 2 I- Then 


^E[X n X n T ]||=||^E[||^-^ 


u\ 12 e u e u \ 


= maxE [|| S u - 5„|| 2 ] 

U L 

= max Tr \V T D U V] . 

U 1 J 

Since D u < and using the fact that A < B implies Tr [V T (B — A)V] > 0, we get 11 E [X u Xj] 11 < 
^rn. Similarly, 

|| Y> [A„ T A„] II =11 [(S„ - S«) T (S« -SJ] II 


=11 u(£v^d u v)u t 


=II' /T (E D “)’ / I 


where the last inequality follows from D u < ffl and the fact that A < B implies ||K T ril^|| < ||y T i?L||. 
Therefore, cr 2 < ^Xf ma x{m, n}. Now by applying the matrix Bernstein’s inequality, we get 

115 — 511 <3 max {max | X, t || log n, cr\J\ogn) 

U 

<8y/(l — e) max{m, n} log 3 / 2 n. 

with probability at least 1 — 2 jn. 


H. Proof of Lemma [5] 

We prove Lemma [5] based on a counting argument. By Lemma [4| with probability at least 1 — 2/n, 

l|S-S||<8v/(l-e) max{m, n} log 3 / 2 n. 


I|5 — 5|| <||5 — 5|| + ||5 — 5|j 
<2||5 — 5||, 


Note that 
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6 = 4 6 = 10 




Fig. 2: Performance Comparison of the standard spectral clustering algorithm and Algorithm [TJ The y-axis 
is expected number of comparisons N provided by each user. The algorithms succeed in the parameter 
regime above the corresponding curves. The two algorithms using net-win vectors show significantly better 
clustering performance. 



Fig. 3: Score vector estimation for different r. For each r, the blue curve shows how the relative error 
changes with r, and ' s minimized when r — r. From the red curve, r can be identified by 

looking for the r such that the change || 6(r) — 9{r — 1)|| is minimized. 


where the second inequality follows from the definition of S and the fact that S has rank r. Since the 
matrix S — S is of rank at most 2r, we get 

||S-S||'J.<(V27||S-S||) 2 

<8r\\S-S\\ 2 

<512(1 — e)r max{m, n} log 3 n. 

As ((S' — S\\p = IlS'u — S u 11 2 , we conclude that there are at most 

512r max{m, n} log m log n 
(1 — e)m 2 

users with \\S U — S , u || 2 > 


VII. Experiments 

In this section, we illustrate the performance of our algorithm using synthetic data. Since the key idea 
of the paper is to show that the net-win vectors S u are easier to cluster than the comparison vectors R „, 
in the first experiment, we compare the clustering performance of Algorithm [T] with two other clustering 
algorithms to verify the effectiveness of the dimension reduction. In the second experiment, we demonstrate 
the performance of score vector estimation and suggest a heuristic for estimating the number of clusters. 
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A. Clustering performance comparison 


In Algorithm [Ij we cluster the rows of S using a threshold based algorithm, which is for the ease of 
analysis. For the numerical experiments, we use K -means clustering algorithm instead, which is more 
robust. We initialize the centers for K -means clustering as follows. First, randomly pick a row as a center. 
Then pick the row whose minimum distance from existing centers is maximized and add it to the centers. 
Continue this process until we have picked r centers. In this experiment, we compare Algorithm [T] with 
two other clustering algorithms: 

1) the standard spectral clustering that applies the A'-means algorithm to cluster the rows of R, which 
is the rank r approximation of R, and 

2) the projected K -means that applies the K -means algorithm to cluster the net-win vectors S = 

~^RA T . 

yjm 

We have also tried applying the K -means algorithm to cluster the rows of R. However, as the K -means 
algorithm never even approximately recover the clusters in the parameter regime considered, we do not 
include its performance in the plot. 

We evaluate the algorithms by measuring the fraction of misclustered users. Let { Cf } denote the true 
clusters and {Ck} denote the clusters generated by some clustering algorithm. For each k, we say C'/. 
corresponds to true cluster k' if the majority of users in C k are from Cy, and we count any user who 
is from a different true cluster as an error. Then the fraction of misclustered users is defined as the total 
number of errors divided by the total number of users. 

Fix m — n — 1200 and let b = 4 or 10. Figure [2] shows the performance of these three algorithms. 
The x-axis is the number of clusters r and y-axis is the average number of comparisons N = (1 — e) ('") 
provided by each user. Each point on a curve shows, for the given number of clusters r, the smallest N 
such that the average fraction of misclustered users of an algorithm over 50 experiments is less than 5%, 
in which case we say the algorithm succeeds. In other words, the algorithms succeed in the parameter 
regime above the corresponding curves. 

Compared to the algorithms using net-win vectors, the standard spectral clustering algorithm has very 
poor performance. As we explained in Section IV the underlying reason is that the deviation \\R U — 
E \R, V ] || 2 = © (my/ 1 — e), which is much larger than the distances between different clusters given by 

0 ((1 — e)m); the net-win vectors are much less noisy with \\S U — E [S u \ || 2 = O ( y/(l — e)m\ogn 


Furthermore, our Algorithm [7] performs better than the projected K-means. As we explained in Section 


IV this is due to the fact that the spectral clustering step in our algorithm further de-noises the net-win 


vectors by projecting them onto the space spanned by the top r right singular vectors of S. Notice that 
the case b = 10 is not covered by our theorems, but Figure [2] shows that the clustering algorithms have 
similar and even better performance in this case. 


B. Estimating the number of clusters r 

In practice, the number of user clusters r is usually not known a priori. One way to get around this 
difficulty is to first guess the number of clusters r and then apply our algorithm. In the experiment, we first 
clusters the rows of S using the K -means algorithm for each r and then apply the maximum likelihood 
estimation for the score vector in each cluster. 

We fix m — n — 120, b = 5 and e = 0.95. Figure [3] shows the simulation results for r = 1, 2,4 and 8. 
For each r, the blue curve shows how the relative error changes with r. When f is smaller than 

r, two or more true clusters are assigned to one cluster and the error in 6 is large. When r is equal or 
slightly larger than r, the estimation 6 approximate 6 quite well as each cluster returned by our clustering 
algorithm is mainly consisted of users from one true cluster. In particular, in the first plot where there 
is only one cluster, the relative estimation error does not grow much even for r = 6. However, when r 
is too large, there will be many small clusters and the variance in 6 can be very large, which also could 
result large estimation error. 
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If we view 9 as a function of r, the red curve shows how the change of 9 in r, i.e., 11 9{r) — 9(r — 1) 11 2 , 
changes with r. For comparison purpose, we normalize this difference by ||0|| 2 . From the experiment, 
a good heuristic for identifying the number of clusters r is by looking for the r such that the change 
|| 9(7) — 9{r — 1)|| 2 is minimized. 


VIII. Conclusions 

This paper studies the problem of clustering and ranking items with pairwise comparisons obtained 
from multiple types of users. The key idea is that projecting the comparison vectors onto a particular 
low dimensional linear subspace significantly reduces the noise and improves the clustering performance; 
the projection can be efficiently computed by calculating the net-win vectors for each user. Our proofs 
require b 6 [6 0 , 5] to show the projection preserves the distances between clusters, while the experiments 
indicate that the means of the net-win vectors S'/.’s for different clusters are well separated for any b > b 0 . 
An interesting future work is to prove that this result indeed holds for a wide range of b. Also, under 
a deterministic sampling model where the set of observed entries of the comparison matrix R is fixed, 
the net-win vectors are known to be the sufficient statistics for estimating the score vector under the BT 
model if the clusters are known; it is interesting to prove similar results when the clusters are unknown. 
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Appendix 

Note that AA T = ml — J = L m , which is the Laplacian of the complete graph. It is easy to check 
that the eigenvalues of L m are 0, m, and the eigenvector corresponding to the zero eigenvalue is given 
by h=1. Therefore, A is of rank m — 1 and all its nonzero singular values are s/rii. Moreover, the SVD 
of A is A = y/mU V T and [U, -^=1] is an orthogonal matrix. Let U t be the i-th row of U and then 

||f/j|| 2 = \J (m — 1 )/m for i = 1 , ... ,m — 1 . For 1 < i < j < m, let V i3 denote the th row of V 
and ( A T U)ij denote the (i, j)-th row of A T U. Since V = A U/s/m, it follows that V l} = -=(f/, — Uj) 
and 


\\V., 







2 


By definition, R^Jj = ^rf(0 u ,t, — 0 u j). The function f(x) is nonlinear and can be approximated by the 
linear function x/2 when x is close to 0. Since \9ki — 9 k j\ < b for all i,j, the maximum approximation 
error is given by 5(b) = \ f(b) — ||. 

By definition, for any k, 

A A-^nejA)vi y T 

+ ^(Ma) - \e T k A)vu\ 

Then, the difference between ,5'/. and Sy is lower bounded by 


\\Sk - Sk'lU 

> l -^Vm\\(e k -e k ,)u\\ 2 - 1 -^ 


\\( f (6lA)--6lA)V\\ 2 


W(fKA) + -e T k ,A)V\\ 2 


As J2i 9k,i = 0k',i = 0, 


\\(e k - e k ,)u\\ 2 = ||( 0 fc - 0 * 0 [u, -=i]|| 2 = ||0* - 0*' 11 2 • 

\ m 


Using the fact that 


we get 




II pel A) - i elA\[,_ - Ki)‘ 

m 


i<j 


m\\d k \\ 2 2 . 
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Therefore, 


II s k - SV|| 2 >(1 - e)y/m -\\e k - e k >|| 2 


6(b) 


m\u + 1 M 2 ) 


First we bound ||# fc — (9/,./11 2 . Recall that (9/,. is the centered version of 0 k , and 9 ki are generated i.i.d. 
uniformly in [0, b}. When m > C\ log r, by Hoeffding’s inequality. 


EC- 


mb 


< C 2 y / mlogr, 


EC.- 


mb 


< Co \Jm log r 


with high probability. By definition, 

\\»k - S t ,\\ 2 >||#» - o“,|| 2 - ||F(EC-EOHh 

i i 

To bound \\0 k — 6 || 2 , note that 

mb 2 


eDK-^III] = e 


E<c - c.) 


Define A', = (0°, - - £. Then A', < Ir, A [A’J = 0 and 

6 4 


E [X, 2 ] —E [(«?,, - «2,,,) 4 ] - - 


=E 


c -1 


+ 6E 


ol-,) = 


k',i 


E 


nO u 

V k ',i 2 


6 4 

36 


6 4 6 4 6 4 6 4 6 4 

<— + — +-< — . 

“80 6 80 36 “ 6 

By Bernstein’s inequality, when m > C : > logr, with high probability, 

mb 2 




6 


< Ci \Jm log rb 2 


When m is large enough, we have \\9 k — 9 k t\\ 2 > y/0.9mb 2 /6 with high probability. 

Next we bound ||0 fe || 2 . By definition, ||6* fc || 2 < ||6^|| 2 . Note that E [||#°|||] = mb 2 /12. Following the 
similar argument as above, we can show that, when m > C 5 logr, ||0 fc || 2 < ||0°|| 2 < ^I.lmb 2 /12 with 
high probability. 

Combining the two inequalities, we get 


>(7(1 — e)m, 


6(1 — e)m 


where the last inequality holds because 6 e [6 0 ,5], 6(b)/b increases with 6 and \J^§- > y^r^y-- 
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The assumption that b is large implies that | 9 k ,i — 9 k j\ is large for any k and i < j. To show this, we 
note that 

p[| Ok,i-e kJ \ < 1 ] < jj. 

Then by union bound we get 


p \4k,i < j, \e k ,i - e kJ \ > i] > 1 - ^ > i - . 

b log m 


( 2 ) 


In the following we will assume 9 k ,i ^ 9 k ,j for i ^ j. By definition, S k = ^f(9 k A)VU T . Define 
Vk,ij = >6k j — <e k } t0 1 * 1C signed indicator variable of the order between 9 kji and 9 k j. Then 

P* - s*.|| J-^i\\(j(e k A) - f(e k ,A))v || 2 

> [life - w)v \\2 - ||(/(M)- m)v\\ 2 

-\\(f(O t 'A)- m ,)V\\ 2 

>-—- illto* - ’W) v b - ll/tot-4) - ’(tlb 


\\f{9 k 'A) — Tj k ' 112 


First we show that \\f(9 k A) — 77 ^.11 2 < Ciy/rn. When 1 9 ki — 9 kj | > t, 

I f(O k ,i ~ 9 kJ ) - r) kiij \ < < 2e“*. 

According to ([2]), for any integer 1 < t < m, there are m — t pairs of 9 k and 0 kl separated at least by 
t. Therefore, 


II f{9 k A) - r) k \\l < ^(m - t)4e 2t < C-ym. 

t =1 


We bound || f[9 k 'A) — rj k i || 2 similarly. 

Next we show that \\(j] k — rj k ')V\\ 2 > C^rn. Observe that 

11 tot - *Jt-)V||| =11^^111 + WnvV\\\ - 2r,„VV T ^ 

=?(™ 2 - 1) - —VtA T Aijl, (3) 

S m 

where the second equality follows from Lemma [lj By definition of A, (y k A ] ) t represents the number of 
9j that are smaller than 0, minus the number of 9j that are larger than (l,. Therefore, r) k A ] and rjk'A 1 are 
independent random permutations of the deterministic vector [— [m — 1), — (rn — 3),, rri — 3, rn — 1]. 
Without loss of generality, assume r] k A T = [— (m — 1), — (m — 3),..., m — 3, m — 1] and denote y k 'A T 

by x which is a random permutation of 7j k A T . Let Z = rj k A T Ar}l = ~( m ~ l)^i H-h {rn — l)x m and 

define the martingale y % — E [Z\x\, ..., Xi\. In particular, we have y 0 — E [Z] — 0 and y m = Z. We also 
note that | y i+1 — y t \ < 2 rn 2 . By Azuma’s inequality, 

P[|^| > t] = P[| y m - y 0 \ >t]< 2e"^, 


and thus \Z\ < C 3 m 5 / 2 log 1//2 m with high probability. Plugging it into Q, we get [[( 77 ^ — y k >)V \\2 > C' 2 m. 
Combining the above two steps, we conclude that 


\\S k -S k '\\>C(l-e)m. 











